Skip to content

fix: \G regex anchor — respect pos() in all match contexts#469

Merged
fglock merged 4 commits into
masterfrom
fix/perl-tidy-tests
Apr 9, 2026
Merged

fix: \G regex anchor — respect pos() in all match contexts#469
fglock merged 4 commits into
masterfrom
fix/perl-tidy-tests

Conversation

@fglock
Copy link
Copy Markdown
Owner

@fglock fglock commented Apr 9, 2026

Summary

Two fixes for the \G assertion in the regex engine, discovered while investigating Perl::Tidy test failures:

  1. \G when pos() is undef: Previously, the \G anchor check was skipped when pos() was undefined, allowing \G(\s+) to scan forward and match at any position. Now \G anchors at position 0 when pos is undef, matching Perl behavior.

  2. \G in non-/g matches: Previously, pos() was only consulted for /g matches. \G in non-/g matches (e.g. $str =~ /\Gfoo/) always started from position 0 regardless of pos(). Now pos() is consulted whenever \G is present in the pattern.

Impact on Perl::Tidy

  • Fix 1 corrects option parsing — parse_args uses \G/gc patterns to tokenize option strings. Options like -dac were silently dropped.
  • Fix 2 corrects the tokenizer's signature detection — $input_line =~ /\G\s*\(/ (non-/g) was always matching at pos 0 instead of the current position.

Perl::Tidy test results: 5/44 → 7/44 passing files. New passes: atee.t, filter_example.t, test_DEBUG.t.

Remaining 37 failures are all DESTROY-related (singleton counter not decremented).

Test plan

  • make passes (all unit tests including new \G regression tests)
  • New regression tests added to regex_g_pos.t:
    • \G anchoring when pos() is undef
    • \G in non-/g matches at various positions
    • \G/gc tokenizer simulation (parse_args pattern)
  • Verified filter_example.t (signature detection) now passes
  • Verified atee.t (option parsing) now passes
  • Updated plan doc with fix details and current status

Generated with Devin

fglock and others added 4 commits April 9, 2026 10:44
Investigation of ./jcpan -t Perl::Tidy (v20260204) identified 5 blockers:

1. DESTROY singleton (Critical): Formatter and Tokenizer use closure
   counters decremented in DESTROY. Since PerlOnJava does not call DESTROY,
   the 2nd+ perltidy() call per process dies. Affects 36/44 test files
   (~555 subtests). Fix: 2-line overlay in Perl/Tidy.pm.

2. Option parsing (Moderate): perltidyrc string ref options (-dac, -bl)
   not applied. Possibly Getopt::Long negatable boolean handling.

3. Wide char alignment (Low): Unicode display width miscalculation.

4. EOL handling (Low): t/test-eol.t produces no output.

5. DEBUG output (Low): debugfile scalar ref returns undef.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Two fixes for the \G assertion in the regex engine:

1. \G when pos() is undef: Previously, \G was only checked when pos()
   was defined (isPosDefined). When pos() was undef, the \G anchor
   check was skipped entirely, allowing the regex to scan forward and
   match at any position. Now \G always anchors at startPos (which
   defaults to 0 when pos() is undef), matching Perl behavior.

2. \G in non-/g matches: Previously, pos() was only looked up for /g
   matches. \G in non-/g matches (e.g. $str =~ /\Gfoo/) always
   started from position 0 regardless of pos(). Now pos() is looked
   up whenever \G is present in the pattern, matching Perl behavior
   where \G anchors at pos() even without /g.

These fixes are critical for Perl::Tidy compatibility:
- Fix 1 corrects option parsing (parse_args uses \G/gc tokenizer)
- Fix 2 corrects the tokenizer signature detection (\G in non-/g)

Perl::Tidy test results improve from 5/44 to 7/44 passing files.
All remaining failures are DESTROY-related (singleton counter).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Updated plan doc to reflect the two \G regex fixes applied:
1. \G when pos() is undef - anchors at 0 instead of scanning forward
2. \G in non-/g matches - now respects pos() like Perl does

Documented remaining DESTROY blocker as the sole remaining issue
blocking 33+ test files (snippet tests, wide char, EOL).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The -T heuristic was too strict for UTF-8 files. It treated all bytes
>= 128 as non-text, requiring > 70% printable ASCII to classify as text.

Files with significant UTF-8 content (e.g. Cyrillic, Polish, German
umlauts) were misclassified as binary, causing Perl::Tidy to skip them
with "Non-text (override with -f)".

Now matches Perl pp_fttext heuristic:
- Valid UTF-8 multi-byte sequences are treated as text (not odd)
- Invalid high-bit bytes are counted as odd
- Binary if odd * 3 > length (same 1/3 threshold as Perl)

Fixes testwide-passthrough.t and testwide-tidy.t file-to-file tests.
All 48 subtests that run now pass (0 failures). Remaining 37/44 test
file failures are purely DESTROY-related.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock fglock mentioned this pull request Apr 9, 2026
3 tasks
@fglock fglock merged commit 60ae63b into master Apr 9, 2026
2 checks passed
@fglock fglock deleted the fix/perl-tidy-tests branch April 9, 2026 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant